HTML API: Add normalization functions. #7331

dmsnell · 2024-09-11T16:38:52Z

Trac ticket: Core-62036.

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

westonruter

Nice: force_balance_tags(): The Next Generation

src/wp-includes/html-api/class-wp-html-processor.php

westonruter · 2024-09-11T17:32:26Z

src/wp-includes/html-api/class-wp-html-processor.php

+	 */
+	public function serialize(): ?string {
+		if ( WP_HTML_Tag_Processor::STATE_READY !== $this->parser_state ) {
+			return null;


Would this make sense to throw an exception?

I've tried to avoid throwing exceptions in use code. Tell me more about the value of potentially crashing vs. returning null

Well, I guess to give more information about why it is returning null. Maybe _doing_it_wrong() then would be better? It would be helpful to get feedback in code for what is documented:

* This differs from {@see WP_HTML_Processor::normalize} in that it starts with * a specific HTML Processor, which _must_ not have already started scanning; * it must be in the initial ready state and will be in the completed state once * serialization is complete.

ah okay I see now. another thought I had was resetting to the beginning, parsing, and returning to the previous location, which involves double-parsing if already mid-way through a document.

I've called wp_trigger_error() in these cases.

westonruter · 2024-09-11T17:40:24Z

src/wp-includes/html-api/class-wp-html-processor.php

+		while ( $this->next_token() ) {
+			$token_type = $this->get_token_type();
+
+			switch ( $token_type ) {


What about processing instructions? Shouldn't they get a special treatment?

For example, <html><body><?php foo(); ?> is interpreted as:

Seems like it should get serialized back in the same way? Maybe not since the browser serializes this as . But maybe that should be an option?

ah good catch: it should also serialize the PI node tag name, which would match what you wrote. looks like this needs a review of all of the invalid comment syntax

these should all be updated now. if something lingers I'd like to fix it, but ultimately if we botch an invalid comment, I'm guessing it's not the end of the world.

these will go into test cases.

src/wp-includes/html-api/class-wp-html-processor.php

westonruter · 2024-09-11T17:45:28Z

src/wp-includes/html-api/class-wp-html-processor.php

+			}
+
+			if ( ! $in_html && $this->has_self_closing_flag() ) {
+				$html .= '/';


While not required, it seems a space is usually added here in the wild, right? (e.g. Prettier does this)

Suggested change

$html .= '/';

$html .= ' /';

a good thought. with double-quoted attributes it's not relevant, but with unquoted attribute values it becomes relevant. we don't need that since we control quoting. maybe it's best to add it in anyway for the same of other tools.

westonruter · 2024-09-11T18:03:45Z

src/wp-includes/html-api/class-wp-html-processor.php

+		if ( null !== $this->get_last_error() ) {
+			return null;
+		}


Similarly here, it would be helpful to know why it returned null.

this is a tougher question because it would conflate the basic ?string return value. practically I think this can only occur if the HTML is unsupported (in which case we really shouldn't return any processed string) or we've run out of bookmarks (which should be unrealistically rare - and that reminds me, I found 2500 bookmarks sufficient to parse everything in my set of ~300k HTML documents, and I intend on upping the default value to support that for 6.7).

suppose you know why this failed: what would you do in response?

It could also use _doing_it_wrong() here too to communicate that information, I suppose. Or rather, wp_trigger_error() would be the more relevant function. If I knew why it failed then I wouldn't have to figure out why it failed. True it probably wouldn't impact the result in the end, but for debugging it would be useful.

Looking at where last_error is set, it seems to always coincide with throwing an exception anyway. So in practice would this if statement ever be entered?

the exceptions thrown internally are caught and shut down parsing, but does not crash. unsupported content exceptions are returned via get_unsupported_exception()

I've called wp_trigger_error() in these cases.

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

If code later in the processing pipeline adds unquoted attributes and doesn't add the requisite space following that, then another parser might find that the solidus is part of the attribute value instead of serving as a self-closing flag. Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

src/wp-includes/html-api/class-wp-html-processor.php

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

sirreal

This is pretty exciting, I'd like to start adding tests for it.

I just added it to the html api debugger when supported.

I'd love to start adding tests for this. One good test will be idempotency, where after an initial normalization, subsequent normalizations will be identical.

This mentions null bytes here specifically:

Text will be re-encoded, null bytes handled, and invalid UTF-8 replaced with U+FFFD.

I think that's working correctly in text. Should it also be handled in tag names, attribute names, and attribute values?

Input (null bytes replaces for clarity)

<div␀-nb nb-att-␀-="nb-val-␀-">

Normalized output:

<div␀-nb nb-att-␀-="nb-val-␀-"></div␀-nb>

Expected:

<div�-nb nb-att-�-="nb-val-�-"></div�-nb>

src/wp-includes/html-api/class-wp-html-processor.php

sirreal · 2024-09-12T14:05:48Z

There are some known issues from HTML5lib tests similar to the PI problems mentioned here: #7331 (comment)

wordpress-develop/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php

Lines 28 to 30 in 4712210

    
           'comments01/line0155'    => 'Unimplemented: Need to access raw comment text on non-normative comments.', 
        
           'comments01/line0169'    => 'Unimplemented: Need to access raw comment text on non-normative comments.', 
        
           'html5test-com/line0129' => 'Unimplemented: Need to access raw comment text on non-normative comments.',

There's no good way to read the comment under some circumstances and something like a get_raw_comment_content() method would be helpful.

wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat

Line 156 in 4712210

<?xml version="1.0">Hi

wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat

Line 170 in 4712210

<?xml version="1.0">

wordpress-develop/tests/phpunit/data/html5lib-tests/tree-construction/html5test-com.dat

Line 130 in 4712210

<?import namespace="foo" implementation="#bar">

Each of these does not satisfy the PI constraint (missing ? before the > closer) so they're treated as invalid HTML comments. The initial ? isn't accessible through get_modifiable_text(), modifying that character could change the token to something completely different.

There are a couple of cases like this, I think they're all <? or <!-started strings triggering the bogus comment state.

CORRECTION:

I've edited it, it was initially incorrect. The bogus comments starting with <! do ignore the ! in their contents. Only <? seem to be mishandled.

Input	Expected	Actual	Correct
`<?xml foo >`	`<!--?xml foo -->`	`<!--xml foo -->`	⛔️
`<!>`	`<!---->`	`<!---->`	✅
`<! more stuff >`	`<!-- more stuff -->`	`<!-- more stuff -->`	✅

sirreal · 2024-09-12T17:22:32Z

We'd talked about a method to really inspect different types of comment text content. I've proposed a method in #7342. That would be helpful here.

sirreal

This method is nice and it seems like it's in a good place. I'm happy to see it getting tests.

I'd like it if null bytes were normalized in more places (tag names, attribute names and values) before this lands.

src/wp-includes/html-api/class-wp-html-processor.php

HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in #7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. git-svn-id: https://develop.svn.wordpress.org/trunk@59076 602fd350-edb4-49c9-b593-d223f7449a82

HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in WordPress/wordpress-develop#7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. Built from https://develop.svn.wordpress.org/trunk@59076 git-svn-id: http://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd

HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets. This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing. Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not able to provide the full safety that will eventually appear with `set_inner_html()`. Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text). Developed in WordPress/wordpress-develop#7331 Discussed in https://core.trac.wordpress.org/ticket/62036 Props dmsnell, jonsurrell, westonruter. Fixes #62036. Built from https://develop.svn.wordpress.org/trunk@59076 git-svn-id: https://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd

dmsnell · 2024-09-20T23:05:32Z

Merged in [59076]
03b12dc

dmsnell added 2 commits September 11, 2024 09:37

Lower-case names when calling for qualified tag name.

07e89bb

dmsnell added 2 commits September 11, 2024 09:47

Adjust some text encoding.

754118e

Remove inner text from elements which cannot contain it.

9c6e34c

westonruter reviewed Sep 11, 2024

View reviewed changes

dmsnell and others added 7 commits September 11, 2024 11:11

Lower-case inside of the serialization function.

7f5783e

Progressive tense for function summaries.

2aa89fa

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

Expand support for bogus comments.

00a5773

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

Whitespace

99799b7

Update docs examples.

db24c11

Raise WP error when failing.

7b8aa53

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

westonruter reviewed Sep 11, 2024

View reviewed changes

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved

dmsnell mentioned this pull request Sep 11, 2024

HTML API: Normalize with constraints. dmsnell/wordpress-develop#20

Open

Change error level to warning.

18b5005

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>

dmsnell mentioned this pull request Sep 11, 2024

Add an XML serializer. dmsnell/wordpress-develop#21

Closed

sirreal reviewed Sep 12, 2024

View reviewed changes

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved

This was referenced Sep 12, 2024

WIP: HTML API: Add set_inner_html() to HTML Processor #7326

Draft

HTML API: Add get full comment text method #7342

Closed

dmsnell added 2 commits September 20, 2024 11:57

WIP: Move actual token serialization into a single-unit method.

47f7f08

Add basic unit test suite.

92558b9

dmsnell force-pushed the html-api/normalize-html branch from c5f1924 to 92558b9 Compare September 20, 2024 18:57

dmsnell mentioned this pull request Sep 20, 2024

HTML API: Plans for 6.7 WordPress/gutenberg#60396

Open

19 tasks

sirreal approved these changes Sep 20, 2024

View reviewed changes

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved

Merge branch 'trunk' into html-api/normalize-html

6167887

dmsnell added 7 commits September 20, 2024 13:40

Call the new get_full_comment_text()

95756e5

Make the kwality 110% butter

fca6481

Fix wrong test setup.

b37b312

More tests and NULL byte transformation.

dd4ff16

Fix text alignment issue in comment.

588020f

Fix text alignment in docs.

0abf367

Expand docblock for serialize_token

37b7fe8

dmsnell closed this Sep 20, 2024

dmsnell mentioned this pull request Sep 21, 2024

WIP: HTML API: Add an XML serializer. #7408

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Add normalization functions. #7331

HTML API: Add normalization functions. #7331

dmsnell commented Sep 11, 2024 •

edited

Loading

github-actions bot commented Sep 11, 2024 •

edited

Loading

github-actions bot commented Sep 11, 2024

westonruter left a comment

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

dmsnell Sep 11, 2024

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

dmsnell Sep 11, 2024

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

westonruter Sep 11, 2024

dmsnell Sep 11, 2024

dmsnell Sep 11, 2024

sirreal left a comment

sirreal commented Sep 12, 2024 •

edited

Loading

sirreal commented Sep 12, 2024

sirreal left a comment

dmsnell commented Sep 20, 2024

HTML API: Add normalization functions. #7331

HTML API: Add normalization functions. #7331

Conversation

dmsnell commented Sep 11, 2024 • edited Loading

github-actions bot commented Sep 11, 2024 • edited Loading

github-actions bot commented Sep 11, 2024

Test using WordPress Playground

Some things to be aware of

westonruter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal left a comment

Choose a reason for hiding this comment

sirreal commented Sep 12, 2024 • edited Loading

sirreal commented Sep 12, 2024

sirreal left a comment

Choose a reason for hiding this comment

dmsnell commented Sep 20, 2024

dmsnell commented Sep 11, 2024 •

edited

Loading

github-actions bot commented Sep 11, 2024 •

edited

Loading

sirreal commented Sep 12, 2024 •

edited

Loading